Back

JAMIA Open

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match JAMIA Open's content profile, based on 37 papers previously published here. The average preprint has a 0.06% match score for this journal, so anything above that is already an above-average fit.

1
Stigmatizing Language Detection in Opioid Use Disorder Patient-Directed Discharge Clinical Documentation: A Privacy-Preserving Analysis Using a Locally Deployed Large Language Model

Izzo, J. A.; McIntyre, A. M.; Nguyen, J.; Bashaw, D.; Torrance, C. A.; Foster, J.

2026-06-01 health informatics 10.64898/2026.05.29.26354402 medRxiv
Top 0.1%
14.1%
Show abstract

Objective: Stigmatizing language in the electronic health record (EHR) has been associated with adverse patient experience in substance use disorder care, including opioid use disorder (OUD). This study evaluated a privacy-preserving, locally-deployed large language model as a method to detect stigmatizing language documentation in OUD patients with patient-directed discharge (PDD). Methods: A retrospective cohort study of 477 inpatient admissions from the MIMIC-IV database with a diagnosis of opioid use disorder were classified using a locally deployed Gemma-4-31b-it-bf16 model and predefined 140 term lexicon to identify stigmatizing language in clinical documentation. Results: Analysis of clinical documentation showed stigmatizing language was present in 84.1% (190/226) in the PDD cohort vs 62.2% (156/251) in the non-PDD cohort, with an unadjusted odds ratio of 3.21 (95% CI 2.07-4.98; p < 0.0001). After adjustment for age, sex, insurance status, marital status, and race, PDD discharge remained an independent predictor of stigmatizing documentation (aOR 2.24, 95% CI 1.40-3.59; p < 0.0001). Further analysis of stigma intensity showed higher stigmatizing markers in the PDD cohort vs the non-PDD cohort (2.85 {+/-} 2.39 vs 2.02 {+/-} 2.44; p < 0.0001). Discussion and Conclusion: Stigmatizing language is detected with increased frequency and prevalence in clinical documentation of OUD patients that initiate PDD compared to those that adhere to standard discharge processes. A locally deployed large language model (LLM) offers a scalable, privacy-preserving method to audit clinical documentation for stigmatizing language.

2
SmartAlert: Integrating Machine Learning and Alert Triggers into Live Electronic Medical Record Systems, Targeting Low-Yield Inpatient Lab Tests

Jiang, Y.; Ma, S.; Liang, A.; Kim, G.; Acharya, A.; Mony, S.; Punnathanam, S.; Makeown, J.; Jose, J.; Shieh, L.; Pham, T.; Ng, A. Y.; Chen, J. H.

2026-05-06 health informatics 10.64898/2026.04.29.26351965 medRxiv
Top 0.1%
14.1%
Show abstract

This study explores integrating machine learning into electronic medical record systems to predict stability of inpatient lab tests. A smart alerts system was developed and tested at Stanford Hospital. The system identifies stable lab results, advising clinicians on test ordering. Live deployment showed desired precision at good recall in predicting test result stability, with suggestions for system optimization identified. This approach may significantly decrease low-yield testing and enhance personalized clinical decision-making.

3
A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv
Top 0.1%
13.8%
Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [-&gt;] Explainability [-&gt;] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

4
Research Paper on AuditMed: A Single-File, Browser-Based Clinical Evidence Audit Platform Architecture, Current Capabilities, and Proposed Applications in Drug Informatics and Pharmacy Education

Ferguson, D. J.

2026-04-20 health informatics 10.64898/2026.04.19.26351188 medRxiv
Top 0.1%
10.5%
Show abstract

BackgroundClinical pharmacists, trainees, and educators increasingly rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions and support therapeutic decisions. Commercial integrated platforms exist but remain inaccessible to many learners in community, rural, and international training contexts. ObjectiveThis paper describes the architecture and current capabilities of AuditMed -- a single-file, browser-based clinical evidence audit platform -- and reports the findings of an initial internal validation exercise in which the platforms one-click Clinical Inference workflow was applied to a pre-authored compendium of ten complex clinical cases (Cases 30-39) to produce synthetic data outputs which are simulated for the deterministic engine. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical APIs into a six-stage Search[-&gt;]Select[-&gt;]Parse[-&gt;]Analyze[-&gt;]Infer[-&gt;]Create pipeline and produces nine structured output formats. A complex clinical case compendium (10 pre-authored cases spanning multi-organ disease, oncology, immunology, and hematology) was processed through the one-click inference workflow. Each output was reviewed against the cases documented diagnoses, lab values, medications, and pharmacy-boards-level teaching points to identify correctly flagged findings, missed findings, and structural artifacts. ResultsAcross ten cases, the platform correctly surfaced six high-yield teaching points: a three-agent serotonin-syndrome combination (Case 30), dual hepatotoxicity (Case 31), an amiodarone-digoxin P-gp interaction (Case 34), voriconazole CPIC-A pharmacogenomics (Case 36), phenytoin HLA-B*15:02 screening (Case 38), and warfarin CYP2C9/VKORC1 dosing guidance (Case 39). Four structural problems recurred: (i) a 50-layer polyroot reasoning chain was identical across all cases regardless of clinical seed; (ii) an input-parsing error mapped patient age to body temperature, producing physiologically impossible fevers of 41{degrees}C to 70{degrees}C in six cases; (iii) the diagnosis field in Case 38 was corrupted by a medication-dosing instruction that propagated through the entire report; and (iv) the embedded drug-drug interaction matrix was sparse, missing clinically significant pairs including ciprofloxacin-warfarin, voriconazole-cyclosporine, amphotericin-aminoglycoside nephrotoxicity stacking, and prednisone-warfarin. DiscussionAuditMeds pharmacogenomic and selected drug-drug interaction modules produced clinically meaningful outputs when they fired, and the platforms explicit verification-step architecture is a structural strength. The recurrent failure modes -- a boilerplate polyroot chain, input parsing without sanity validation, and sparse DDI coverage -- are amenable to short, high-yield engineering fixes and do not require re-architecting the pipeline. Case-level clinical question framing also suffers from a relevance-ranking bias toward medications in list order rather than by clinical acuity. ConclusionAuditMed demonstrates proof-of-concept for a free, transparent, auditable multi-source evidence synthesis workflow suitable for pharmacy education and drug-informatics training. The internal validation exercise reported here identifies four specific, tractable engineering priorities whose resolution is required before the platform can be responsibly recommended beyond research and educational use. Formal external validation against retrieval and synthesis benchmarks remains planned.

5
Evaluating Large Language Models for Translating Multimodal Phenotype Documentations into Executable EHR Phenotyping Algorithms

Yan, C.; Xin, Y.; Su, W.-C.; Gangireddy, S.; Durbhakula, S.; Bruehl, S. P.; Dickson, A. L.; Li, L.; Feng, Q.; Malin, B. A.; Derr, T.; Wei, W.-Q.

2026-05-22 health informatics 10.64898/2026.05.20.26353690 medRxiv
Top 0.1%
10.4%
Show abstract

Research applications of electronic health record (EHR) phenotypes require translating clinical definitions into executable EHR database queries, a labor-intensive process. We evaluated two frontier large language models across five phenotypes and three documentation modalities. Both models captured high-level logic from structured text but degraded markedly with diagram-only input. Error analysis revealed seven failure categories. Documentation, rather than model capability, was the primary bottleneck, reinforcing the need for standardization and expert oversight.

6
The Associations of Diabetes Mellitus and Obesity on Osteoporosis

Thomas, M. G.; Jayasuriya, A. C.

2026-03-18 orthopedics 10.64898/2026.03.16.26348517 medRxiv
Top 0.1%
10.3%
Show abstract

ObjectivesOsteoporosis is a common and debilitating condition that disproportionately affects older adults, particularly women, leading to increased fracture risk and reduced quality of life. While traditional risk factors such as age, hormonal changes, and lifestyle are well established, the impacts of diabetes and obesity on osteoporosis remain unclear. This study aimed to investigate associations between diabetes, obesity, and osteoporosis diagnosis in Caucasian women aged 64 years and older. Materials and MethodsData on osteoporosis diagnosis, diabetes diagnosis, and body mass index (BMI) were obtained from the publicly available Study of Osteoporotic Fractures (SOF) database. Statistical analyses were conducted using IBM SPSS software. Associations between diabetes, BMI, and osteoporosis were evaluated at two study visits (visit 1 and visit 8). Analysis of variance (ANOVA) and correlation analyses were used to assess relationships among variables. ResultsNo significant association was found between diabetes and osteoporosis at visit 1 (p = 0.966); however, a statistically significant association emerged at visit 8 (p < 0.001). A weak negative correlation between diabetes and osteoporosis was observed at visit 8 (r = -0.068, p < 0.001), indicating that participants with diabetes were slightly less likely to be diagnosed with osteoporosis. BMI category was significantly associated with osteoporosis at both visits (p < 0.001). Post hoc analyses revealed that overweight and obese women had a lower likelihood of osteoporosis than underweight or normal-weight participants. ConclusionsDiabetes showed no consistent association with osteoporosis diagnosis, whereas higher BMI appeared to exert a protective effect against osteoporosis in older women.

7
Nationwide Prediction of Missed and Cancelled Appointments Using Real-World EHR Data

Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.

2026-04-13 health informatics 10.64898/2026.04.08.26349942 medRxiv
Top 0.1%
10.3%
Show abstract

ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [&ge;]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI

8
Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics 10.64898/2026.04.25.26351733 medRxiv
Top 0.1%
10.2%
Show abstract

ObjectiveTo develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. MethodsA clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs [&ge;]35 years) and gender. SHAP was developed for model interpretability. ResultsEnsemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged [&ge;]35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. ConclusionThis study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

9
Machine Learning Estimation of Gestational Age at Delivery Using Linked Mother-Infant Electronic Health Records Across Two Health Systems

Bejan, C. A.; Yang, X.; Pham, A.; Qassem, L.; Abraham, A. A.; Choi, L.; Rosenbloom, S. T.; Gamire, L. X.; Phillips, E. J.

2026-05-25 obstetrics and gynecology 10.64898/2026.05.23.26353959 medRxiv
Top 0.1%
10.1%
Show abstract

Objective This study aimed to train and evaluate supervised machine learning algorithms using electronic health record (EHR) data to accurately estimate gestational age at delivery. <br>Materials and Methods We trained random forest, gradient boosting, and ensemble models on EHR data of mother-infant dyads from Vanderbilt University Medical Center(VUMC) and replicated the analyses at University of Michigan (UMich). We further analyzed EHR predictors of gestational age, assessed temporal drift in EHR data elements, and evaluated model performance stratified by delivery status. <br>Results The study included pregnancies corresponding to 54,344 and 34,345 mother-infant dyads at VUMC (2005-2025) and UMich (2012-2024), respectively. The gestational age predictions of the ensemble models achieved the highest agreement with the reference standard on the VUMC dataset ({+/-}1 week: 85.2%, {+/-}2 weeks: 94.3%, MAE: 4.4 days) and demonstrated stronger generalization on the UMich dataset ({+/-}1 week: 93.1%, {+/-}2 weeks: 97.8%, MAE: 2.8 days). Further, performance was better among pregnancies delivered in more recent years, and among full- and late-term deliveries compared with preterm deliveries. <br>Discussion The results indicate that supervised machine learning methods leveraging linked mother-infant EHRs can accurately estimate gestational age at delivery, while demonstrating the generalizability of the modeling approach and the portability of the analytic workflow across healthcare sites. <br>Conclusion This study presents a robust and generalizable machine learning framework to estimate gestational age at delivery. The framework can be reliably used to impute gestational age in large-scale, real-world clinical studies to support maternal and neonatal health research, in which accurate estimation of pregnancy onset is critical.

10
Shared Strides: Community-based, high-throughput biomechanics data collection in knee osteoarthritis

Qualter, J. M.; McCloskey, R. C.; Stofer, K. A.; Qiu, P.; Tian, Z.; Vincent, H. K.; Costello, K. E.

2026-03-25 orthopedics 10.64898/2026.03.23.26349064 medRxiv
Top 0.1%
9.9%
Show abstract

Objective: This analysis assessed the acceptability and recruitment implications of a high-throughput, community-based biomechanics protocol among individuals with knee osteoarthritis (OA). Design: During the Shared Strides Study, high-throughput markerless biomechanics assessment was conducted at community sites to help facilitate research engagement in the OA population. In this cross-sectional study, biomechanics data during a set of activities of daily living (ADLs) and questionnaire data were collected. Adults aged 40 years or older with knee OA participated at one of four sites across Gainesville, FL--two on-campus and two community-based. Eligible individuals were either screened over the phone and scheduled for a specific date and time or screened on site for potential same-day participation. Participant acceptability of the community-based biomechanics data collection approach was assessed using a 15-item custom questionnaire. Recruitment characteristics and participant preferences were compared across sites. Results: The high-throughput community-based data collection approach was well received. Compared with on-campus sites, community-based sites had higher engagement from walk-in participants and new research participants (40% of the sample). Familiarity with, and distance to, a data collection site were important factors in research engagement in this population. No differences in demographic characteristics existed between sites (p > 0.05), but recruitment resulted in a large sample size (n = 85) likely representative of the communities surrounding the selected sites. Conclusions: Integrating markerless motion capture with a community-based research approach may enhance the participant experience and facilitate larger, more heterogeneous sample sizes, ultimately reducing bias and homogeneity in current OA biomechanics research.

11
Professionalism Pulse: Development and Validation of a Natural Language Processing Pipeline and Dashboard for Safety Culture Surveillance in NYC Health + Hospitals

Mangut, E.; Wallace, R.

2026-05-22 health informatics 10.64898/2026.05.19.26353620 medRxiv
Top 0.1%
9.8%
Show abstract

Background: Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the "professionalism climate" within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States. Methods: A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline's accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen's Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance. Results: The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions. Conclusion: Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.

12
Combining centralized and decentralized approaches to assess and ensure data quality in Eurocrine(R) via Microsoft Power BI and DataquieR

Musholt, T. J.; Clerici, T.; Bergenfelz, A.; Schmidt, C. O.; Struckmann, S.

2026-06-05 health informatics 10.64898/2026.06.04.26354884 medRxiv
Top 0.1%
9.8%
Show abstract

Background: Medical registries have gained importance in the evaluation of healthcare quality outcomes. In the absence of high-quality evidence, such as randomized controlled trials, studies based on registry data are essential for informing clinical guidelines. Methods for assessing data quality are rarely described in detail. To ensure the credibility of registry-based studies, registries must use all available technical and operational means to guarantee high data quality. Method: Eurocrine(R) is a pan-European endocrine surgical database and quality registry initially funded by the EU healthcare programme, which started in 2015 and now includes more than 200,000 interventions as of April 2025. To ensure high data quality, interactive and standardized reports are created via Microsoft Power BI, which are created both centrally and locally. In addition, comprehensive data quality analyses were performed via the R-based package dataquieR. Results: Although a multitude of technical measures (for example, input screen design and real-time plausibility checks during data entry) are in place, they are not sufficient to prevent human errors at data entry. Errors identified in the reports were corrected, and preventive measures were implemented. Overall, the data quality was assessed as very good in terms of completeness, accuracy, and consistency. Conclusion: It is very important to provide registry users with an efficient and smart tool to identify data issues, as they have the clinical information to correct them. Data quality reports generated with dataquieR represent an effective tool for registry administrators. Predesigned Microsoft Power BI reports enable participating Eurocrine(R) clinics to self-audit their data.

13
Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv
Top 0.1%
8.8%
Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

14
Impact of pharmacist board certification on health outcomes of critically ill patients: An analysis of the Optimizing Pharmacist-Team Integration for ICU patient Management (OPTIM) study

Smith, S. E.; Henry, K.; Heavner, M.; Keedy, C.; Duong, H.; Chen, Z.; Chen, X.; OPTIM Investigator Team, ; Sikora, A.

2026-06-02 intensive care and critical care medicine 10.64898/2026.05.26.26353672 medRxiv
Top 0.1%
8.7%
Show abstract

BACKGROUND: Critical care pharmacists (CCPs) reduce adverse drug events (ADEs) and mortality in the intensive care unit (ICU). Board certification is the established professional standard for CCPs but its impact on ICU patient outcomes, including its relationship between CCP characteristics and workload, remain unclear. The purpose of this study was to evaluate the association between pharmacist board certification, CCP workload characteristics, and patient outcomes. METHODS: This was a pre-planned analysis of the multicenter, observational Optimizing Pharmacist Team Integration for ICU Patient Management (OPTIM) study, including adult ICU patients cared for by CCPs. Patients cared for exclusively by board certified pharmacists on every ICU day were categorized as the BCP group; those with at least one day of care from a non board certified pharmacist comprised the non BCP group. The primary outcome was hospital mortality; secondary outcomes included the hazard of discharge alive (HDA) from the ICU and hospital. Multivariable logistic regression was used to evaluate the association between BCP and mortality; Fine-Gray competing risk models were used to assess the relationship between BCP and ICU and hospital HDA. RESULTS: A total of 201 pharmacists (184 BCPs; 17 non BCPs) from 63 institutions caring for 20,537 ICU patients were included. Care provided exclusively by a BCP (vs. >/= 1 day by a non-BCP) was associated with lower mortality (OR 0.80, 95% CI 0.69 to 0.92, p=0.002) and both a higher ICU HDA (HR 1.08, 95% CI 1.03 to 1.13, p<0.001) and hospital HDA (HR 1.19, 95% CI 1.13 to 1.26, p<0.001). CONCLUSION: Daily ICU care delivered by pharmacists with board certification was independently associated with reduced mortality and improved hazard of discharge alive from the ICU. Board-certified pharmacists may enhance the quality and/or efficiency of critical care pharmacy services. These findings support the role of board certification as a modifiable factor to improve patient outcomes and optimize workload in the ICU.

15
PheBee: A Graph-Aware System for Scalable, Traceable, and Semantic Phenotyping

Gordon, D. M.; Homilius, M.; Antoniou, A. A.; Grannis, C.; Lammi, G. E.; Herman, A. C.; Kubatko, A.; Chaudhari, B. P.; White, P.

2026-05-13 health informatics 10.64898/2026.05.09.26352812 medRxiv
Top 0.1%
8.6%
Show abstract

ObjectivesPhenotype-driven workflows in clinical and translational research require standardized ontology-based representation, ontology-aware cohort discovery, and provenance inspection for each assertion. Existing approaches optimize either for semantic traversal or scalable batch analytics, but not both. We describe PheBee, a hybrid system that links semantic assertions to scalable evidence storage via a deterministic identifier, preserving provenance while supporting ontology-aware discovery at cohort scale. Materials and MethodsPheBee represents phenotype assertions in a knowledge graph as ontology-linked nodes with clinical modifier context (e.g., negated, family history), and stores supporting evidence records in a scalable row-oriented evidence table for cohort-scale access. The two layers are connected by a deterministic identifier enabling stable joins across repeated ingestions without duplicating high-volume evidence in the graph. We evaluated PheBee using synthetic datasets designed to exercise end-to-end ingestion and query workflows. ResultsFunctional evaluation validated hierarchical term expansion, qualifier-aware retrieval, duplicate-free assertion handling under re-ingestion, and privacy-conscious management of subjects shared across multiple research projects. At scale (10,000 subjects producing 12M evidence records) PheBee completed ingestion in [~]30 minutes and responded to interactive queries within 6 seconds under concurrent load. DiscussionPheBee exposes a unified API for ontology-aware cohort discovery with hierarchical term expansion, subject-centric retrieval of phenotypes and clinical modifiers, and evidence and provenance queries. Its data model aligns with GA4GH Phenopackets, facilitating interoperability with phenotype exchange standards. ConclusionBy combining ontology-aware semantics with scalable, provenance-bearing evidence storage, PheBee provides a practical open-source foundation for phenotype-driven research workflows that demand both semantic precision and cohort-scale traceability. LAY SUMMARYResearchers often use "phenotypes" (observable clinical features) to describe individual subjects and find groups of similar subjects. Those phenotypes come from many sources and need both standard terminology and clear evidence for why a phenotype has been associated with a subject. PheBee is a software system that stores phenotype assertions in a way that supports both "ontology-aware" searching (for example, finding patients with any subtype of a condition) and scalable storage of supporting evidence across large research cohorts. PheBee uses multiple types of data storage so researchers can perform interactive phenotype searches and also store millions of pieces of supporting evidence. A shared identifier connects the two storage layers, so subjects phenotypes and their supporting evidence remain linked even as new data is added over time. We evaluated PheBee using fully synthetic (non-patient) data to confirm correct query behavior, evidence traceability, and system performance at large scale.

16
Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Chuang, K.-C.; Lin, H.-J.; Lin, H.-M.

2026-05-26 health informatics 10.64898/2026.05.23.26353939 medRxiv
Top 0.1%
8.4%
Show abstract

Background: Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April-May 2026) for open-ended medication review. Methods: Fifty synthetic CKD cases across three complexity groups (G3a-G3b [n=20], G4 [n=15], G5/G5D/transplant [n=15]) with 8-12 medications and [&ge;]2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample. Results: Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% (P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high (kappa = 0.934, n = 92). Conclusions: This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.

17
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Proulx, J.; Daines, B.; Barton, M.; Leonard, M. E.; Garcia, J. A.; Young, B.; Snell, Q.; West, T. W.; Watson, S. R.; AlQaseer, M.; Louiset, M.; Maqsood, M. B.; Voutt-Goos, M. J.; Douma, C.; Kasbekar, N.; Jeffries, J.; Abu-Rahmeh, W.; Frush, K.; Grewal, D. K.; Bahsoun, M.; Leonard, M.; Frankel, A.; Classen, D. C.; Pestotnik, S. L.

2026-06-10 health informatics 10.64898/2026.06.05.26354271 medRxiv
Top 0.1%
8.4%
Show abstract

Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

18
Enhanced Adverse-Event Detection and Drug-Event Relation Extraction from Clinical Notes

Alharbi, O.; Wu, C. H.; Chen, C.; Shanker, V.

2026-05-08 health informatics 10.64898/2026.05.06.26352616 medRxiv
Top 0.1%
8.4%
Show abstract

Adverse drug events (ADEs) are a significant source of preventable patient harm, yet many ADE signals remain buried in free-text clinical notes. Clinical notes often describe adverse events (AEs) in relation to drugs in two ways: whether a drug causes the AE (the AE is an ADE) or a drug is given to treat an AE (it is considered the Reason for drug treatment). In the N2C2 2018 benchmark, ADEs and Reasons are annotated as separate entity types, despite often being similar in both wording and clinical meaning. This shared similarity makes them difficult to distinguish during entity extraction, leading to errors in relation classification. Therefore, we propose a two-stage framework that first detects AEs as a unified event category and then classifies drug-event pairs into Drug-ADE, Drug-Reason, or No-Relation. In the end-to-end evaluation on the N2C2 2018 benchmark, our system achieves F1 scores of 0.93 for Drug-ADE and 0.94 for Drug-Reason, improving over previously reported end-to-end benchmarks of 0.48 for Drug-ADE and 0.59 for Drug-Reason. Overall, these results support a more precise task formulation in which AEs are detected broadly first, and the ADE vs Reason distinction is resolved at the relation layer. Furthermore, they motivate the development of AE-focused datasets annotated independently of drug linkage to enable more reliable end-to-end pharmacovigilance systems.

19
From Carb Counting to Diagnosis: Real World Patient Uses and Attitudes Toward Large Language Models in Diabetes Management

Nkweteyim, R. N.; Shet, V. G.; Iregbu, S.; He, L.

2026-03-19 health informatics 10.64898/2026.03.10.26348079 medRxiv
Top 0.1%
8.2%
Show abstract

Managing diabetes-related conditions is time-intensive and cognitively demanding for patients and caregivers, requiring ongoing glucose monitoring, dietary regulation, physical activity planning, and continuous lifestyle adaptation. With the emergence of large language models (LLMs), patients have increasingly turned to these tools for information, guidance, and support. However, there is limited empirical understanding of which diabetes-related medical tasks patients delegate to LLMs and what their experiences are. To address this gap, we combined qualitative thematic analysis with LLM-assisted analysis to examine patient attitudes and real-world use cases in using LLMs for diabetes-related tasks. Our analysis identified diverse application areas, ranging from clinical interpretation to nutrition and diet support, and disease management amongst others. LLMs functioned not only as information sources, but as interpretive, analytical, decision-support, emotional, and logistical aids supporting patients self-management. Last, we discuss implications for integrating LLMs into patients self-management support ecosystems and identify areas that require support and safeguards.

20
Electronic health record implementation: how to reduce the possible negative impacts

Calderon, P. F.; Wolosker, N.

2026-03-25 health informatics 10.64898/2026.03.24.26347438 medRxiv
Top 0.1%
8.2%
Show abstract

Objective: Develop a methodology to implement action plans that mitigate the negative impacts associated with the EHR implementation project and evaluate their effectiveness in reducing these issues. Methods: The research involved the development of mitigation plans for the potential negative impacts of implementing an electronic health record system, ensuring their execution and subsequently analyzing the effectiveness of the method. Results: Findings confirmed that 19.3% of 264 identified impacts were resolved through 52 plans before Go Live. During Go Live, the remaining 213 impacts were addressed through 337 plans. Six months later, 190 impacts were confirmed, and the plans were considered effective or partially effective in 80.5% of cases. Conclusions: Effective governance, a multidisciplinary methodology, and well-planned and executed actions increase the likelihood of success for health technology projects.